Search CORE

73 research outputs found

Window-Slicing Techniques Extended to Spanning-Event Streams

Author: Raschia Guillaume
Tassetti Damien
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th International Symposium on Temporal Representation and Reasoning (TIME 2020)
Publication date: 01/01/2020
Field of study

Streaming systems often use slices to share computation costs among overlapping windows. However they are limited to instantaneous events where only one point represents the event. Here, we extend streams to events that come with a duration, denoted as spanning events. After a short review of the new constraints ensued by event lifespan in a temporal sliding-window context, we propose a new structure for dealing with slices in such an environment, and prove that our technique is both correct and effective to deal with such spanning events

Dagstuhl Research Online Publication Server

An algebraic approach to ensemble clustering

Author: Dumonceaux Frédéric
Gelgon Marc
Raschia Guillaume
Publication venue: HAL CCSD
Publication date: 01/08/2014
Field of study

International audienceIn clustering, consensus clustering aims at providing a single partition fitting a consensus from a set of independently generated. Common procedures, which are mainly statistical and graph-based, are recognized for their robustness and ability to scale-up. In this paper, we provide a complementary and original viewpoint over consensus clustering, by means of algebraic definitions which allow to ascertain the nature of available inferences in a systematic approach (e.g. a knowledge base). We found our approach on the lattice of partitions, for which we shall disclose how some operators can be added with the aim to express a formula representing the consensus. We show that adopting an incremental approach may assist to retain significant amount of aggregate data which fits well with the set of input clusterings. Beyond that ability to model formulae, we also note that its potential cannot be easily captured through such a logical system. It is due to the volatile nature of handling partitions which finally impacts on ability to draw some valuable conclusions

Summary Management in P2P Systems

Author: Hayek Rabab
Mouaddib Noureddine
Raschia Guillaume
Valduriez Patrick
Publication venue: HAL CCSD
Publication date: 01/03/2008
Field of study

International audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer suf- ficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the appropriate algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

Decision Support to Crowdsourcing for Annotation and Transcription of Ancient Documents: The RECITAL Workshop

Author: Aubert Olivier
Hervy Benjamin
Raschia Guillaume
Rubellin Françoise
Publication venue
Publication date: 30/05/2023
Field of study

In the 18th century in Paris, only two public theatres could officially perform comedies: the Com{\'e}die-Fran{\c c}aise, and the Com{\'e}die-Italienne. The latter was much less well known. By studying a century of accounting registers, we aim to learn more about its successful plays, its actors, musicians, set designers, and all the small trades necessary for its operation, its administration, logistics and finances. To this end, we employ a mass of untapped and unpublished resources, the 27,544 pages of 63 daily registers available at the Biblioth{\`e}que Nationale de France (BnF). And we take a decidedly fresh look at emerging forms of creation and changes in the entertainmenteconomy. We developed the crowdsourcing platform RECITAL to collect and index the data from theregisters, following an emerging trend in Digital Humanities. RECITAL is built upon the ScribeAPI framework and it offers a fully-fledged web application to classify the pages, annotate with marks and tags, transcribe the indexed marks and even to verify the previous transcripts. We also describe a multi-level data model and to develop a series of monitoring anddecision tools to support crowdsourced data management up to their definitive form

arXiv.org e-Print Archive

User and Usage Profiling in a Multi-platform Service Environment

Author: Aghasaryan Armen
Betge-Bresetz Stephane
Gelgon Marc
Raschia Guillaume
Publication venue
Publication date: 15/04/2011
Field of study

University of Hildesheim

Joining Distributed Database Summaries

Author: Bechchi Mounir
Mouaddib Noureddine
Raschia Guillaume
Publication venue: HAL CCSD
Publication date: 01/01/2008
Field of study

The database summarization system coined SaintEtiQ provides multi-level summaries of tabular data stored into a centralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, in many companies, data are distributed among several sites, either homogeneously (i.e. , sites contain data for a common set of features) or heterogeneously (i.e. , sites contain data for diﬀerent features). Consequently, the current centralized version of SaintEtiQ is either not feasible or even not desirable due to privacy or resource issues. In this paper, we propose two new algorithms for summarizing heterogeneously distributed data without a prior "uniﬁcation" of the data sources: Subspace-Oriented Join Algorithm (SOJA) and Tree Alignement-based Join Algorithm (TAJA). The main idea of such algorithms consists in applying innovative joins on two local models, computed over two disjoint sets of features, to provide a global summary over the full feature set without scanning the raw data. SOJA takes one of the two input trees as the base model and the other one is processed to complete the ﬁrst one, whereas TAJA rearranges summaries by levels in a top-down manner. Then, we propose a consistent quality measure to quantify how good our joined hierarchies are. Finally, an experimental study, using synthetic data sets, shows that our joining processes (SOJA and TAJA) result in high quality clustering schemas of the entire distributed data and are very eﬃcient in terms of computational time w.r.t. the centralized approach

INRIA a CCSD electronic archive server

Cluster-based Search Technique for P2P Systems

Author: Hayek Rabab
Raschia Guillaume
Valduriez Patrick
Publication venue: HAL CCSD
Publication date: 01/01/2008
Field of study

We consider network clustering as the way to improve the performance of locating data in unstructured P2P systems. Connectivity-based Distributed node Clustering (CDC), and SCM-based Distributed Clustering (SDC) are two major protocols that allow partitioning a network topology into clusters, based on node connectivity. These protocols focus on the accuracy of the clustering scheme, i.e. using the Scale Coverage Measure (SCM), and its maintenance against node dynamicity. However, they do not propose search techniques that may take advantage of their clustering information. Thus, their proposals have not been evaluated according to the motivation behind. In this work, we propose a new, efficient Cluster-based Search Technique (CBST) for unstructured P2P systems. We use it to validate connectivity-based clustering schemes, according to the trade-off between cost of maintaining clusters, and benefit for query processing. Our experimental results show the efficiency of CBST implemented over the SDC protocol. By simply exploiting clustering features of the underlying network, a query can travel across a large number of nodes with a minimum number of messages. CBST eliminates a large portion of redundant messages, thus avoiding to overload the P2P network

INRIA a CCSD electronic archive server

Design of PeerSum: a Summary Service for P2P Applications

Author: Hayek Rabab
Mouaddib Noureddine
Raschia Guillaume
Valduriez Patrick
Publication venue: HAL CCSD
Publication date: 02/04/2007
Field of study

International audienceSharing huge databases in distributed systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A more efficient approach is to rely on compact database summaries rather than raw database records, whose access is costly in large distributed systems. In this paper, we propose PeerSum, a new service for managing summaries over shared data in large P2P and Grid applications. Our summaries are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. Our main contribution is to define a summary model for P2P systems, and the algorithms for summary management. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

INRIA a CCSD electronic archive server

Summary Management in P2P Systems

Author: Hayek Rabab
Mouaddib Noureddine
Raschia Guillaume
Valduriez Patrick
Publication venue: HAL CCSD
Publication date: 01/03/2008
Field of study

INRIA a CCSD electronic archive server

Peersum : Gestion des résumés de données dans les systèmes P2P

Author: Hayek Rabab
Mouaddib Noureddine
Raschia Guillaume
Valduriez Patrick
Publication venue: HAL CCSD
Publication date: 01/11/2007
Field of study

Base de Données Avancées (BDA)National audienceSharing huge, massively distributed databases in P2P systems is inherently difficult. As the amount of stored data increases, data localization techniques become no longer sufficient. A practical approach is to rely on compact database summaries rather than raw database records, whose access is costly in large P2P systems. In this paper, we consider summaries that are synthetic, multidimensional views with two main virtues. First, they can be directly queried and used to approximately answer a query without exploring the original data. Second, as semantic indexes, they support locating relevant nodes based on data content. The main contribution of this paper is to define an efficient algorithm for partitioning an unstructured P2P network into domains, in order to optimally distribute summaries in the network. Then, we propose a distributed algorithm for maintaining a summary in a given domain. Our performance evaluation shows that the cost of query routing is minimized, while incurring a low cost of summary maintenance

INRIA a CCSD electronic archive server